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Abstract 

Calibration is a basic property for prediction systems, and algorithms 
for achieving it are well-studied in both statistics and machine learning. 
In many applications, however, the predictions are used to make decisions 
that select which observations are made. This makes calibration difficult, 
as adjusting predictions to achieve calibration changes future data. We 
focus on click-through-rate (CTR) prediction for search ad auctions. Here, 
CTR predictions are used by an auction that determines which ads are 
shown, and we want to maximize the value generated by the auction. 

We show that certain natural notions of calibration can be impossible 
to achieve, depending on the details of the auction. We also show that 
it can be impossible to maximize auction efficiency while using calibrated 
predictions. Finally, we give conditions under which calibration is achiev- 
able and simultaneously maximizes auction efficiency: roughly speaking, 
bids and queries must not contain information about CTRs that is not 
already captured by the predictions. 



1 Introduction 

Calibration is a fundamental measure of accuracy in prediction problems: if 
we group all the events a predictor says happen with probability p, about a 
p fraction should occur. This property has been extensively studied in the 
stochastic and online settings. 

We study problems where the predictions themselves partially determine 
which events occur. Our general approach applies to many problems where 
predictions are used to make decisions, but we are motivated in particular by the 
application to search engine advertising. Over the past decade, this business has 
grown to tens of billions of dollars, and prediction systems play a fundamental 
role. 

In a typical interaction, first a user does a query (say "flowers" ) on a search 
engine. Then, the search engine selects a set of candidate ads that can be 
shown on the given query, based on keywords provided by advertisers. These 
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components can be reasonably approximated by an IID process. A prediction 
is made for each candidate ad, and an auction ranks the ads based on the 
prediction and the bid of the advertiser. Typically, the bid indicates the value 
of a click to the advertiser, and the score is simply the product of the bid and the 
prediction, giving an estimate of the value generated by showing the ad. Finally, 
some of these ads are shown to the user (we consider two models: the single 
top-ranked ad is shown, or all the ads with scores above a certain threshold are 
shown). This auction selection mechanism has been extensively studied, and 
has many nice properties Til 3 ■ 



In this setting, an important measure of the quality of the predictions is how 
much value the auction generates (equivalently, how efficient are the allocations 
produced by the auction). The auction mechanisms we consider are in fact 
designed to maximize the combined value to the search engine and advertiser if 
bids accurately reflect value and the true click-through-rates (CTRs) are known. 

The algorithm used to predict CTRs for such a system faces many constraints 
already, for example, the need to process enormous volumes of data quickly 
and produce predictions with extremely low latency (e.g., [IH). Thus, rather 
than advocating new algorithms, we focus on applying a post-correction via a 
prediction map to the outputs of an existing system in order to improve the 
quality of the predictions. 

We consider two main questions. Informally stated: 1) Do efficiency-maximizing 
prediction maps with calibration properties exist, and can they can be found 
computationally efficiently? 2) If we iteratively calibrate our predictions so they 
match observed CTRs, does the process converge? And if so, is this prediction 
map efficiency maximizing? 



Outline and Summary of Results We formalize our model and questions in 
Section [2J where we introduce two primary variants of the selection mechanism 
that lead to different properties; Section [3] and @] investigate these mechanisms 
in the general case. We demonstrate that without further assumptions, in both 
our models it may be impossible for a deterministic prediction map to produce 
calibrated predictions on the ads it serves, and iterative calibration procedures 
can fail badly. Since some deterministic map always maximizes value, this is 
unfortunate. When all ads above a certain threshold are shown, we give an 
algorithm for finding this value-maximizing map in polynomial time, but when 
the single highest-rated ad is shown, we prove finding the value- maximizing map 
is NP-hard (even if we knew the true CTRs) . 

In Section [5] we introduce additional assumptions that are sufficient to guar- 
antee calibration procedures are well-behaved. While these assumptions are 
fairly strong, they are not unreasonable for real systems. Our strongest as- 
sumption is essentially that in all cases bid and query provide no more informa- 
tion than the raw prediction about average CTRs; under this assumption, wc 
can show in both selection models a value-maximizing and calibrated prediction 
map exists. Under threshold selection, somewhat weaker conditions are in fact 
sufficient. 
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Related Work Calibration has been extensively studied. Much of the ear- 
liest work is in the probabilistic forecasting literature 0, @, [lj| ■ Calibration is 
particularly important when comparing predictors, since two sets of calibrated 
predictions can be fairly evaluated by how concentrated they are on observed 
outcomes @, EH, El- Calibration also makes it easier to use predictions. For 
example, it is easier to threshold the output of a calibrated classifier to minimize 
weighted classification error Q. 

Not all prediction systems are naturally calibrated. However, when examples 
are drawn IID, if we have a good but uncalibrated predictor, we can calibrate 
it by applying a prediction map. For example, boosted trees are uncalibrated, 
but become excellent probability estimators after calibration [lj| Q ■ The two 
most common methods for calibration are Piatt scaling, which is equivalent to 
logistic regression, and isotonic regression 17. 20l Til 3j. 



Calibration is also studied in the online setting, where no stochastic assump- 
tions are made on the sequence of examples; in the worst case, they could be 
chosen by an adversary that sees our predictions. It is easy to see that in this 
setting, no deterministic classifier (or prediction map) can produce calibrated 
predictions for all sequences. However, if the system is allowed to use random- 
ness (that is, predict a distribution), then calibration can be achieved ([j|. [Io| 
and 0, Sec 4.5]). 



2 Problem Formalization 

The interaction of calibration and selection has received little direct attention in 
the literature, so constructing a suitable model requires some care: we require 
a formulation that is theoretically tractable but still captures the key charac- 
teristics of the real-world problems of interest. 

We begin by defining our units of prediction (queries and ads) and the mech- 
anism used to select them (auctions). We assume a fixed, existing prediction 
system provides a raw prediction for each ad; our study will then concern predic- 
tion maps, functions that attempt to map these raw predictions to calibrated 
probabilities. Once this framework is established, we can formally state the 
questions we study. 

Wc model the interaction between a search engine's users and advertising 
system. There is a fixed finite set of queries Q (strings like "flowers" or "car 
insurance" typed into the search engine), which are chosen according to distri- 
bution Pr Q (q) for q <G Q. There is also a fixed finite set of ads C which can be 
shown alongside queries. Each ad i 6 C is defined by tuple {pi,bi, z.^qi) where 
qi £ Q is the (only) query for which ad i can showl^J pi is the true probability 
of a click, bi is the bid (the maximum amount the advertiser is willing to pay 
for a click), and z, G {1, . . . , K} is a bucketed estimate of pi (we call z% the raw 
prediction). That is, we assume the predictions of the underlying prediction 
system have been discretized into K buckets. Wc drop the q (and sometimes z) 

1 This is without loss of generality, as we can always replicate ads for each query to which 
the advertiser has targeted the ad. 



3 



from the ad tuples when those values are clear from context. Each ad can show 
for a single query q, so we define C(q) = {i \ qi = q}, the indexes of the candidate 
ads for query q. 

Our goal is to find good prediction maps / : {1, ...,K} — > [0,1]. The 
prediction map will be used in the auction selection mechanism: First, a query 
is sampled from Pr®, and then the candidate ads for that query arc ranked by 
b ■ f(z) (we drop the subscripts when wc mean an arbitrary ad). We consider 
two models for which ads show: 

ONE: We only show a single ad. If multiple ads achieve the highest value of 
b ■ f(z), wc pick one uniformly at random. 

ALL: We show all ads where b ■ f(z) — 1 > 0. 

Mechanism ONE models the case of an oversold auction, where ads with different 
raw predictions z must compete for a single position. Mechanism ALL models 
the case where all eligible ads with positive predicted value can be shown. In 
general, mechanism ALL is much easier to work with theoretically, because for 
z\ 7^ Z2, changing f(z\) does not change which ads with prediction z 2 arc shown. 
In either case, we assume any candidate (p, b, z) which is shown is clicked with 
probability 

Distributions on Ads Other than the distribution Pr®, all probabilities and 
expectations will be with respect to some distribution on the set of candidate 
ads C. Two distributions will be of particular importance: Pre, the uniform 
distribution over candidate ads, and Pr/, the distribution of ads shown by a 
prediction map /. Wc formalize these as follows: 

Pre is the distribution on ads where Pre(i) is proportional to Pr (qi). That 

is, letting C = ^2 ieC Pr S (<Zi), we have Prc(i) = Pr . This is not the same as 
choosing a random query q from Pr® and then choosing a random candidate. 
For example, suppose there are two queries q\ and (72, with Pr (#1) = \ and 
Pr Q (<? 2 ) = \- There is one candidate a\ for query q\, and two candidates, 
a-2 and a 3 for query q 2 . Then, Prc(a i ) = 1/3 for each ad, which means the 
marginal probability Pre(<7i) = -| and Pr 0(92) = f ■ One can think of Pr^ as 
the distribution on ads shown if wc showed all the eligible candidates for each 
query that occurs. 

Pr/ for a prediction map / is the distribution on ads where Pif(i) is pro- 
portional to Wi = Pr s (q;)Pr(ad i shows | q%,f). The second term is actually 
only random in the case of selection mechanism ONE, when randomness is used 
to break ties. The distribution Pr/ is thus the distribution on ads shown when 
serving using prediction map /. Using this notation, Pr/(i | q) = Pr(ad i shows 
%,/)• 

We use Ec[-] and E/[-] for the corresponding expectations. 

2 This ignores the well-known issue of position normalization; this aspect of the problem is 
largely orthogonal to our work. 
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Calibration We say a prediction map / is calibrated on a distribution on ads 
D if 

Vz, ®(p,b,z,q)~D\p\z] = fi^h 

Average CTR given z Predicted CTR given z 

The choice of the distribution D in the above definition is critical; a single / 
will in general not be able to achieve calibration for multiple D. For the auction 
selection problem, the natural distribution to consider is Pr/. Thus, we will 
be particularly concerned with finding self-calibrated prediction maps /, which 
satisfy 

Vz, E f \p\z] = f{z). 

In general one may not be able to estimate E/[p | z] exactly, and so cali- 
bration will only be approximately achievable. This issue is orthogonal to our 
results, so we assume that the necessary expected quantities can be estimated 
exactly. Thus, we emphasize that our negative results are a fundamental limi- 
tation, rather than a byproduct of insufficient data. 

Auction Efficiency In addition to calibration, we are concerned with how 
the choice of / impacts the auction mechanism. The expected value of showing 
ad (p, b) is p ■ b — cost, where we take cost = 1 for selection mechanism ALL, and 
cost = for ONE. We assume the bid b reflects the true value to the advertiser of 
a click, which is justified by the incentives of the auction under a suitable pric- 
ing scheme (l9j . The cost can be viewed as the cost per impression of showing 
the ad (either a cost incurred by the user doing the query or incurred by the 
search engine itself). In practice such costs might be different for clicked versus 
unclicked ad impressions, and might vary depending on the ad and query. Ex- 
tending our results to such a models would add a significant notational burden, 
so we focus on the simplest interesting cost models. 
For a given query q, the expected value generated is 

Pr(ad i shows | /, q)(j>ibi — cost). 

iSG(q) 

The expected value per query is just 

EV(/) = ]T Pr Q (q) Pr f (i\q)(Pibi - cost) 

q£Q ieC(q) 

= ^2 Wiipibi - cost), 
iec 

We say an /* £ argmax^ EV(/) is efficiency maximizing. Our goal is to find an 
/ that transforms the z into the best possible predictions in terms of efficiency. 
Note that if it was possible to predict exactly pi for ad i, these predictions would 
maximize efficiency. 
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Questions Ideally, we would like to use prediction maps that are self-calibrated 
and efficiency-maximizing; we say such prediction maps are nice, and say a prob- 
lem instance is nice if such a map exists. 

First, we consider questions relating to the offline problem where we have ac- 
cess to all the problem data. Note that there must exist an efficiency-maximizing 
prediction mapjfl 

Ql Are all problem instances nice? That is, do self-calibrated efficiency- 
maximizing prediction maps always exist? 

Q2 Can an efficiency- maximizing prediction map, even one that is not self- 
calibrated, be found in polynomial time? 

In practice, we are further concerned with learning a good prediction map 
from observed data. Suppose we start with some /o, for example the function 
that gives the predictions of the underlying system. Then, we serve some large 
number of queries with this / , and observe the results. We would like to then 
train an improved f\ from this data, serve another large batch of queries ranked 
using /i, then train an fa, etc. 

A natural procedure is to choose fa so that the predictions on the ads shown 
in batch t — 1 would have been calibrated under ft. Of course, when we then 
select ads using f t on the next batch, we may show different ads. Formally, 
define T : [0, 1] A — > [0, l] 7 ^ (a function from prediction maps to prediction 
maps) by T(f) — f where 



We assume we have enough data in each batch so that we can calculate E/ t l [p 
z] exactly. Then, we ask: 

Q3 Does T always have at most a small (polynomial) number of fixed points? 

Q4 Docs T always have at least one fixed point where ads are shown? 

Q3 is important, because with an affirmative answer we could potentially enu- 
merate the fixed points and find the best one from an efficiency perspective. A 
negative answer to Q4 implies the iterative calibration procedure will cycle. To 
see this, note that for a given starting point fa, subsequent ft{z) can only take 
on finitely many values: E[p|z] for some distribution of ads that show (finitely 
many values), or fa(z). That means that T maps some finite set of calibration 
maps into itself. Since it has no fixed points, T is a permutation and so must 
cycle. 

In the next two sections, we address these questions in the general case 
(putting no additional restrictions on the problem instances). 

3 Note EV depends only on the ordering of the ads for each query induced by /, and so 
over all possible /, EV takes on only a finite number of distinct values. 
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Figure 1: An example with 5 fixed points, one for each prefix of the list of ads. 
For each i, setting p to the value in the "min p" column induces a fixed point 
where ads 1, . . . , i show. The fifth fixed point is the degenerate one that shows 
no ads, with say p = 0. 

3 Mechanism ALL: Threshold Selection 

In this section, we consider the case where we select ads by mechanism ALL, 
that is, we show all ads where b ■ f(z) — 1 > 0. 

We will show that an efficiency-maximizing prediction map can be found 
efficiently (Q2), but without further assumptions, Ql, Q3, and Q4 are answered 
in the negative. We prove the negative results first; for this purpose, it is 
sufficient to construct counter-examples. 

In this section, the examples we construct all require only a single query 
where all of the candidates have the same raw prediction z. Thus, choosing 
prediction map reduces to choosing a single value p G [0, 1]. The selection rule 
simply shows all candidates where b ■ f(z) = b ■ p > 1. 

Ql: All fixed points can have bad efficiency Consider an example with 
2n + 1 candidate ads, divided into three classes, with ads given as (p, b) tuples: 

A) 1 ad is (0.5, 2.0), shown if p > 0.5 

B) n ads arc (1, 1.9), shown if p > 1/1.9 ps 0.53 

C) n ads arc (0, 1.8), shown if p > 1/1.8 ps 0.56 

We either show no ads, A, A + B, or A + B + C. Choosing p = 0.5 is a fixed 
point (it only shows the first ad) which generates value 0.5 • 2 — 1 = 0. Using 
p = 0.54 shows A + B, and generates value 0.9n. But, this is not a fixed point: 
the observed CTR is near one (for large n). Showing all the ads (which occurs 
for any p > 1/1.8) is not a fixed point, and generates negative value, since ads 
from class C generate value — n. 

Q3: An example with exponentially many fixed points Suppose there 
are n candidates (j>i,bi) where the pi are distinct, and we have indexed by 
i so that pi is strictly increasing. Further, suppose bi = a decreasing 

sequence (using the shorthand p\-i = Sj=i Pj)- Pick an y i G {1; ■ • • 5 n }, an d let 
p = p. We show candidate j if bj-p = •gf > 1. Since the bids are decreasing, 
we show candidate j if and only if j < i. Thus, serving with p = ^~ = ^j 1 we 
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show candidates 1, . . . , i, and so the average CTR is in fact p. Thus, for any 
i € {1, . . . , n}, there is a fixed-point p that shows ads {1, . . . , i}. Figure Q] shows 
an example of this construction. If we have m queries each with a distinct fixed 
raw prediction z and n candidates constructed in this manner, we can choose a 
per-query fixed point independently for each query, for n m distinct fixed points. 

Q4: An example with no fixed points Consider a single query with two 
candidates, (pi = 0.7, b\ = 4, z) and (p2 = 0.1,62 = 2,z). For any p > 0.5, 
both ads show and we observe a click-through-rate of 0.4, so no such p can 
be self-calibrated. For any p e [0.25,0.5), only ad 1 shows, and we observe a 
click-through rate of 0.7. For p G [0, 0.25), we don't show any ads. Thus, there 
is no non-trivial fixed point; assuming we start with p > 0.25, the calibration 
procedure will cycle between 0.7 and 0.4. 

Q2: Calculating the efficiency-maximizing / The above examples show 
that self-calibrated prediction maps may not exist, and that even if they do, 
they need not maximize efficiency. 

Nevertheless, given access to the full problem data (including true click- 
through rates) one might be interested in calculating an efficiency maximizing 
prediction map. The following algorithm accomplishes this in polynomial time. 

We define /* by considering each z' € {1,2,..., K} independently: 

1. Consider the set of candidates (p,b,z,q) where z = z' , and sort these 
candidates in decreasing order of bid, for j = 1, . . . ,rij. We must show 
some prefix of this list. In particular, if we set p = 1/bj and bj + \ < bj, 
then we will show exactly ads 1 , . . . , j . 

2. For each j where < bj , compute the expected value per query of using 
pj = 1/bj (which shows ads 1, . . . , j). This can be computed as 

j 

EVfc) = 5>r s ( gi )fe-^-i). 

i=l 

3. Let f(z) = p*j* where pj* is the value that maximizes EV(pj). 

While this result is interesting theoretically (especially in contrast to results 
in the next section), we note it is not likely to be useful in practice: if it was 
possible to estimate pi accurately for each ad, then one could simply throw out 
the coarser-grained predictions Zi and use these estimates. 

4 Mechanism ONE: Selecting One Ad 

In this section, we consider results for selection mechanism ONE. When there is 
only a single query, or only a single raw prediction, selection mechanism ONE can 
be quickly analyzed, and our questions are in fact answered in the affirmative, 
except for Q3. But in non-trivial cases, we again show negative answers to all 
four questions. 



Single query, multiple raw predictions Selection mechanism ONE becomes 
rather degenerate under a single query. We show how to construct a nice /, 
answering Ql and Q2, and Q4 in the affirmative. 

For each raw prediction z' € {1, . . . , K}, observe that if an ad with Zi = z' 
shows, it must be an ad that has bid b(z') = ma,Xj- Zj=z ' bj. Thus, if an ad 
with z' shows, the expected value generated is b(z') ■ Ec[p \ z',b(z')], where 
E,q[p | z',b(z')] is the average click-through-rate of ads with z — z',b = b(z'). 
We can guarantee we obtain this value by simply setting f(z') = Ec[p | z' , b(z')} 
and f(z) = for all z ^ z'. Note that this / is self-calibrated because ties are 
broken uniformly at random under selection mechanism ONE, answering Q4 in 
the affirmative. We obtain maximum efficiency by using the / that only shows 
ads with raw prediction 

z* = arg max b(z) ■ Ec [p \ z, b(z)] . 

z 

Let f z be the / function that only shows candidates with the given z value. 
Thus, f z * is nice. However, we can define a more satisfying /* by 

/•(*) = %>!*]■ 

We only show ads (b, z) where b-f*(z) achieves the argmax value over candidates, 
and in fact 

6 •/*(*) = &(*) -E/.tp | z], 

and so we still maximize efficiency. 

The answer to Q3 is negative: iterative calibration can have exponentially 
many fixed points. Suppose each ad i has a distinct z,, and pi = b^ 1 . Let I be 
any subset of the ads and define fx- fx{z%) = Pi for i el, fx{zi) = for i £ 1. 
Then, under fx all ads in X tie, so we show them randomly. Each of the 2' c ' 
subsets of C thus corresponds to a self-calibrated prediction map that shows a 
different set of ads. 

Multiple queries, single raw prediction Under mechanism ONE, if there is 
a single raw prediction z made for all candidates (on all queries), then the ads 
that show are in fact independent of the value p = f{z) > 0: for each query, 
we always randomly pick one of the candidates with the highest bid. Thus, 
any p > is efficiency-maximizing, and we can choose p equal to the average 
observed CTR to obtain self-calibration. Thus, in this case we answer Ql- Q4 
in the affirmative. 

Q2: NP-hardness in general In general (with at least two distinct raw 
predictions and at least two queries), under selection mechanism ONE, the offline 
problem of finding the efficiency-maximizing prediction map / is NP-hard, even 
if all bids are 1. We show this using a reduction from the minimum feedback arc 
set (MFAS) problem on tournaments (see, for example, Kleinberg et al. (l4j|). 

In this problem, there are n players, {1, . . . ,n}, that have just completed a 
tournament where every pair of players has played. The MFAS for this problem 
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is a ranking of the players that minimizes the number of upsets; that is, if /Zj is 
the rank of player i , we want a ranking [i that minimizes the number of times 
Hi > Hj, but player j beat player i. 

We encode this problem as an auction efficiency maximization problem as 
follows: There are n distinct z values, 1, . . . , n, one for each player, and there 
arc \n(n — 1) emeries (each equally likely), one for each pair with i < j. 
The query for the pair (where i beat j without loss of generality) has two 
candidates (p, z), namely (1, i) and (0, j). Thus, if we show the ad corresponding 
to the winner (with z = i), we have p = 1, and the bid is 1, so we get value 1; 
if we show ad with z = j, we have p = 0, we get no value. It is then clear that 
the efficiency-maximizing ranking of the raw predictions z exactly corresponds 
to the solution to the MFAS problem. 

Negative results for Ql, Q3, and Q4 in general We also show negative 
results for Ql, Q3, and Q4 in general. 

For Ql, observe that in the NP-hardness construction when there is a perfect 
ranking, we observe a CTR of 1.0, and so the efficiency- maximizing prediction 
map cannot be self-calibrated. We can illustrate this directly with the following 
example. There are four ads, each given as (p, b, z) tuples: 

Qi g2 

A (1.0,2,,zi) C (1.0, 2, zi) 
B(0.0,2,z 2 ) D (0.0, l,zi) 

We need f{z\) > /(Z2) in order to guarantee we show Ad A on gi; we also need 
f( z 2) > \f{ z i) in order to show Ad C on (72- We will observe a 1.0 CTR on 
both z\ and Zi on any such efficiency maximizing /, but we are constrained to 
pick J(z2) < f(zi) < 1, and so no such / can be self-calibrated. 

For Q3, we have already shown multiple fixed points in the single-query 
case. If we consider multiple queries, where each query has a single distinct raw 
prediction, wc immediately arrive at a problem with exponentially many fixed 
points. 

For Q4, it is straightforward to construct an example with cycles, but con- 
structing one with no fixed point is a bit trickier. In particular, any time there 
is some prediction z where each query has at least one ad with prediction z, we 
can always find a fixed point by setting f(z') ~ for z' ^ z and f(z) > 0. The 
set of ads shown will be independent of the non-zero value f(z), so we can set it 
equal to the observed CTR, achieving self-calibration (except in the degenerate 
case where all the ads with prediction z have zero CTR) . 

However, it is still possible to construct problems with no fixed points with- 
out resorting to such degeneracy, as the following example illustrates. Each 
query is equally likely, all the bids are 1, and the (p, z) ad tuples are: 

gi <12 qi q-2 

A(0.5,zi) B(0.6,z 2 ) C(0.5,zi) E(0.2,z 2 ) 

D (0.6,02) F (0.3,2i) 
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both Prop El ==>• Prop E2 (immediate) 

ALL Prop E2 <S=S> Prop SI ThmQ] 

ALL Prop E2 =>• nice Thm|2] 

ALL Prop El ==> nice (from above) 

ONE Prop El =>nice Thm[3] 

both Prop SI Prop El Sec 15.21 

ONE Prop E2 Prop SI Sccl5T2l 

ONE Prop SI =^nice Secl5T2l 



Table 1: Relationships between problem properties. A "nice" problem instance 
is one where a self-calibrated efficiency-maximizing prediction map exists. 

If f(zi) > f(z2), then we show ads A,B, C, and F. In this case, we observe a 
CTR of (0.5 + 0.5 + 0.3)/3 = 0.433 for z u and 0.6 for z 2 , so we cannot be self- 
calibrated. If f{z\) < f(z2), we show ads A, B, D, and E, and observe a CTR 
of (0.6 + 0.6 + 0.2)/3 = 0.467 for z 2 , and 0.5 for z\, and so again we cannot be 
self-calibrated. Finally, if f(z\) = f(z2), we always show A and B, and show the 
other ads half of the time. Thus, we observe a CTR of (3/4)0.5 + (1/4)0.3 = 0.45 
for z\, and a CTR of (3/4)0.6 + (1/4)0.2 = 0.5 for Z2, and so again we cannot 
be well-calibrated. Thus, no self-calibrated / exists for this problem. 

5 Sufficient Conditions 

As the previous two sections show, without additional assumptions significant 
problems arise if one tries to achieve both calibration and auction efficiency. In 
this section, we introduce additional assumptions that are sufficient to guarantee 
nice prediction maps exist. Table [1] summarizes our results. 

The intuition behind our results is a basic property of conditional probability. 
Calibration depends on the conditional expectation E\p\ z]. In general, selection 
changes the distribution this expectation is with respect to. But if selection is 
only a function of z, it does not change the conditional distribution of p given 
z, since the latter is already conditioned on z. 

For example, suppose we have a single query, and that all bids are 1, so 
all selection decisions are functions of z. This means that E[p | z] does not 
change under selection, and thus defines an efficiency-maximizing self-calibrated 
prediction map. To extend this intuition to more realistic auctions, we need to 
make sure that the query and the bid do not add any information about p, so 
that selection does not change E[p | z] and the different E[p | z] for each query 
can be reconciled. We now state these properties formally: 

Prop El For each z there exists a value p(z) such that for each query q with 
Ptc(q\ z ) > 0, and for each b with Prc{b\q, z) > 0, 

Ec\ P \z,b,q]=E c \p\z 1 q}=E c \ P \z}=p(z). (1) 
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That is, in all cases the bid and query provide no more information than the raw 
prediction about average click-through ratesQ For this assumption, the natural 
prediction map to consider is f(z) = p{z). □ 



Prop E2 A weaker assumption is that 

E c [p\z,b] = E c \p\z] (2) 

whenever both expectations are defined. This essentially marginalizes over 
queries, rather than holding simultaneously for all q. □ 

Prop SI A problem instance is selection-invariant if for all /, /', for any z 
where both E/[p|z] and Ey/[p|z] are defined, we have 

E f \p\z}=E f ,\p\z}. (3) 

Selection invariance says that the observed CTR for a given raw prediction z is 
independent of the prediction map used for selection. Under this assumption, 
the natural calibration map to consider is f*(z) = Ey, [p \ z], where f z is any 
prediction map that shows some ads with raw prediction z. □ 
It is easy to show that Prop El implies Prop E2. 

A weak per-query variant of Prop El is that, for all z, b, and q (when defined), 
Ec[p | z,b,q] = Ec[p | z, q\. We can dismiss this assumption as insufficient, 
as we can take the negative examples of Section [3] and re-state them where 
each candidate occurs on a distinct query, each equally likely. Thus, the above 
property holds trivially, but the pathological behaviors still occur. 

5.1 Properties that Imply Nice Maps Exist 

First, we show that under mechanism ALL, Prop E2 and Prop SI are equivalent; 
we then show that Prop E2 (and hence also Prop Si) imply a nice problem. 

Theorem 1. Under selection mechanism ALL, Prop E2 is equivalent to Prop SI 
(selection invariance). 

Proof sketch. Suppose Prop E2 holds. Selection mechanism ALL must show ei- 
ther all of the candidates with a given (z, b) combination, or none of them. 
Thus, for any / where Prj(z, b) > 0, we must have 

Ef[p\z,b}=E c [p\z,b}. (4) 

Then, for any /, assuming E^[p|z] is defined, 

E f [ P \z] =M f \Ef\p\z,b}] 

= E f [E c [p\z,b]] Eq. flU) 

= E f [E c [p\z\] PropE2 
= E c [p\z}. 

4 Note that this does not hold under the NP-Hardness reduction for ONE in the previous 
section, as [p | z, q] ^ Eg [p \ z] . 
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For the other direction, suppose we have selection invariance (Prop Si). It is 
sufficient to consider a fixed raw prediction z (if there are multiple z, we can 
consider them independently). Also, we can assume candidates have distinct 
bids - if multiple candidates have the same bid and raw prediction, mechanism 
ALL treats them all the same, so we can just average over them. 

Index the bids (61,62, ••■) in decreasing order. Then, depending on the 
chosen p = f(z), we either show (when the appropriate queries occur) ad 1, or 
ads 1 and 2, etc. Prop SI says that no matter what p is, the average CTR of 
the ads we show is the same. Suppose that all the ads are on the same query. 
Then Prop SI implies pi = \p x + |p 2 , so p x = p 2 ; §(pi +p2) = |(pi +P2+P3), 
so pi = p-2 = P3', and so on. When the ads are on different queries, the weights 
in the above equalities change to reflect the query distribution, but are still all 
positive and sum to 1, so the same inductive reasoning holds. □ 

This result implies that under selection mechanism ALL, when Prop E2 holds 
the prediction map f*(z) = Eg[p| z] is self-calibrated. Next, we show this map 
is in fact also efficiency-maximizing: 

Theorem 2. Under selection mechanism ALL, Prop E2 implies f* is efficiency 
maximizing, where f*(z) =Ec{p\z]. 

Proof. Recall we need to show /* maximizes 



Since selection decisions for one z value do not impact others, it suffices to 
consider a single z value. We can decompose the sum over C over the partition 
that associates all the ads that share a common bid and raw prediction. Let 
B = {i I bi = 6, Z{ — z} C C be the element of this partition for (6, z). For a given 
/(z) = p, either all the ads in B show (when their respective queries occur), or 
none of them do; thus, if wc can show that /* shows these ads if and only if 
they increase EV, we are done. The expected value per query of showing these 
ads is: 



Since Pr(« | g,, /) € {0, 1} must be the same for all these ads, this quantity is 
non-negative if and only if ^2 ieB Pv Q (qi)(pibi — 1) > 0. 

Recall Pr c (i) = Pr s ( 9l )/C where C = J2 ie c PrS fe)- We have Pr c(« A 6 A 
z) = Pre(i) if i 6 B, and otherwise. Letting Cb = Sies Pr2 (9i)' tncn 
Pr c (6 A z) = and so 



EV (f) = E Pr S (*)Pr(* I qi, f)( Pl h - 1). 




(5) 



Pr c (i|6,z) 



C'b/C 



C B /C 



Pr Q (c 
~~ Cb~ 



(ft) 



(6) 
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for i G B, and otherwise. Then, 

E e \p\z] =E e \p\b,z) PropE2 

ieC 

= 7^E PrS fe)^- Eq. © 

Cs TtB 

Using this result, we have 

^Pr^XpA-l) 

ieB 

This quantity is non-negative if and only if bf*(z) — 1 > 0; since this is exactly 
the condition we use to decide whether or not to show the ads in B, we are 
done. □ 

It is not hard to directly prove that under selection mechanism ALL, Prop SI 
implies /* is efficiency- maximizing: the idea is to consider again a single z, sort 
the ads by bid into blocks, and show by induction that each block has average 
CTR f*(z). 

In Section |4] we saw that the problem of finding an efficiency- maximizing / 
is NP-hard under mechanism ONE, even under the assumption of a single bid. 
Under Prop El, fortunately the situation is much easier: 

Theorem 3. Under selection mechanism ONE, if Prop El holds then the predic- 
tion map f* where f*(z) = Ec[p|z] is efficiency-maximizing and self-calibrated. 

Proof. For a query q, consider a partition <B 9 of C(q) into sets of ads that share 
a common b and z, so the elements of the partition are 

B lz = {* I b i = b > z i = z > 9i = Q} C C{q) 

for each (6, z) pair. 

All i G B for some B must share a common value Pr/ (i | q). We also use B 
as the event that some i G B shows; so for example Pr/(-B | q) is the probability 
that some ad from B shows. Under selection mechanism ONE, for each i G B, 
we have Pr/(i B, q) = (since ties are broken at random). Also, 

E c [p|6 I z I g]= T ^- f P>- ( ? ) 

' b ' zl i^K* 



= b(j2 Pi - Q ^p>) -° b ) 

XieB / 



C B UM c \p\z]-l). 
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Recalling cost is zero under ONE, for any /, 
EV(/) 

= E PrC («) E p *fd\<i)Pii>i 

= E PrS (?) E E pr/^k)^^ 

= E Pi ' s w E WJ^E^ 

and using Eq. 0, 

= E Pr %) E Pr f (Bl z \ q )bE c [p\b,z,q] 

<VPr%) max bE c \p\b,z,q\. 

Thus, it is sufficient to show that selecting ads using /* produces the expected 
value in the last line of the above inequality. For each query, we rank the ads 
using b- f*(z) — bEc[p\b, z,q], and so this is exactly the expected value that /* 
obtains. 

To see that /* is self-calibrated, observe that when Pif(z,b,q) > 0, 
E f \p\z,b,q}=Ec\p\z,b,q}=f*(z), 

and so 

E /bM = ^^ } {b,q\z)E f [p\z,b,q] = f*{z). 
6,9 

□ □ 

5.2 Negative Results 

We show several negative results relating to the assumptions considered in the 
previous section. 

ONE and ALL: Prop SI does not imply Prop El Consider an example with 
two queries, each equally likely Each query has two candidates, given as the 
following (jp, b) tuples (they all share a common z): 

gi 92 

A (0.1, 1) C (0.1, 2) 
B (0.2, 2) D (0.2, 1) 
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Because of the symmetry between these queries, under any / (and either se- 
lection mechanism), ad A must show with the same probability as ad D, as 
must ads B and C. Thus, for any /, E/[p | b = l,z] — 0.15, and similarly 
E/[p | b = 2, z] = 0.15. Thus, selection invariance holds, as does Prop E2. 
However, E c [p|z, 6 = l, qi } =0.1 ^ E c \p | z, qi] =0.15. 

ONE: Prop E2 does not imply Prop SI Consider the example, with two 
equally likely queries, and two distinct raw predictions: 

qi 92 

A(0.2,2,zi) C(0.1,2,zi) 
B(0.1,l,zi) D(0.2,l,zi) 
E (1.0, 9, zi) 

Note that Ec[p | Z\,b = 1] = Ee[p | Zi,b = 2] = 0.15. However, if wc consider 
two prediction maps f(zi) = 0.5, /(z 2 ) = 1 and f'(zi) = l,f'(z 2 ) = 0, under 
selection mechanism ONE, we have E/[p|zi] = 0.1, but E/'[p|zi] = 0.15. 

ONE: Prop SI does not imply a nice problem Wc have four queries, each 
equally likely; the bids for the ads on g3 and q<± are defined in terms of some 
small e > 0, with (p, b, z) tuples: 

qi 92 93 94 

A(l,2,*i) C (1,2,2(2) A' (0,26,2!) C'(0,2e,z 2 ) 
B(0,2,z 2 ) D(0,l,zi) B'(l,2e,z 2 ) D'(l,le,zi) 

Note that (73 and (74 mirror qi and g 2 , except that the bids are scaled by e, 
and the CTRs arc reversed. Under any /, ads A and A' show with the same 
probability, as do B and B' , and the other two pairs. Thus, under selection by 
any /, we have E^[p| z{\ = E/[p| z 2 ] = 0.5 whenever the expectation is defined, 
and so Prop SI holds. However, as e — > 0, only q\ and g 2 have any impact 
on efficiency. Thus, as before we have constraints on the optimal solution that 
f{z\) > f(z 2 ) > \f{z\). Thus, the prediction map /* with f*{z\) = 0.5 and 
/*(-z 2 ) = 0.5 is not efficiency- maximizing, as it only shows ad A on qi only half 
the time. 

6 Discussion and Future Work 

Our sufficient conditions are quite strong, but not unrealistic. They require that 
the bid and query not add any information about the CTR, conditional on the 
raw prediction. CTR estimation systems normally use queries as features (e.g., 
(HI), so it is reasonable to hope that the query does not add extra information. 
Bids are set by advertisers for query-ad pairs, which are already used by CTR 
estimation systems, so any systematic patterns in bids arc likely to be accounted 
for. Since advertisers have much less information than the auctioneer, it seems 
unlikely that they can add extra information about CTRs through fine-grained 
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bid manipulation. We can test if our sufficient conditions hold by running 
randomization experiments that change the mix of ads shown. 

Since randomized predictions cannot in general lead to maximum efficiency, 
it is natural to hrst consider deterministic prediction maps. Nevertheless, given 
the negative results in the current work, it would be interesting to also study 
randomized calibration strategies that provide calibration guarantees without 
needing IID assumptions. Then the natural question becomes: how much effi- 
ciency is lost by using a randomized calibration strategy, versus using a deter- 
ministic efficiency-maximizing prediction map that is not self-calibrated. 
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